Sk refactoring by SvenKlaassen · Pull Request #394 · DoubleML/doubleml-for-py

SvenKlaassen · 2026-05-09T13:33:27Z

Summary

Introduce a new DoubleMLScalar / DoubleMLVector class hierarchy alongside the existing DoubleML API. The refactor delivers a cleaner, more testable design with explicit tuning, nuisance evaluation, and sensitivity analysis as first-class features. Two concrete scalar models (PLR, IRM) and one vector model (PLRVector) are ported, each backed by a comprehensive test suite that proves exact numerical equivalence with the legacy classes.

Motivation

The legacy DoubleML base class conflates single-parameter estimation, multi-treatment orchestration, and inference into one large class. This makes it hard to:

Add new models without inheriting unrelated multi-parameter machinery
Swap closed-form vs. numerical score solvers
Test nuisance behavior in isolation from causal inference
Evolve features (tuning, sensitivity) without touching the core fit loop

The new hierarchy separates these concerns via a layered design with explicit hooks.

New Class Hierarchy

DoubleMLBase (ABC)               # data + framework delegation (coef, se, summary, confint, bootstrap, sensitivity)
└── DoubleMLScalar (ABC)         # single-parameter orchestrator (fit, sample splitting, learners, predictions)
    ├── LinearScoreMixin         # closed-form: theta = -E[psi_b] / E[psi_a]
    │   ├── DoubleMLPLRScalar
    │   └── DoubleMLIRMScalar
    └── NonLinearScoreMixin      # (planned) numerical root-finding

Plus a parallel multi-treatment track:

DoubleMLVector                   # multi-treatment base
└── DoubleMLPLRVector            # exact equivalence with legacy DoubleMLPLR for k>1 treatments

See doc/diagrams/architecture.md for the full UML and method-resolution diagrams.

Key Design Decisions

Learners optional in constructor — __init__ accepts learners as optional kwargs (e.g. ml_l, ml_m, ml_g) for one-line construction, but they can also be configured (or replaced) later via set_learners(...). Decoupling the two paths makes it possible to swap learners, re-tune, or re-fit without rebuilding the model.
_learner_names as single source of truth — drives prediction-dict initialization and learner-availability checks; subclasses just declare the list.
Resampling separated from constructor — draw_sample_splitting() is its own step and can be called independently or re-drawn.
Template method fit() — orchestrates draw_sample_splitting() → fit_nuisance_models() → estimate_causal_parameters(). Subclasses implement _nuisance_est() and _get_score_elements(); the mixin provides _est_causal_pars_and_se().
External predictions — passed to fit() / fit_nuisance_models(), validated against _learner_names, and pre-filled before the cross-fitting loop.

What's Included

Core infrastructure

doubleml/double_ml_base.py — abstract base with shared properties and inference delegation
doubleml/double_ml_scalar.py — single-parameter orchestrator (~1.4k LOC)
doubleml/double_ml_linear_score.py — LinearScoreMixin
doubleml/double_ml_vector.py — multi-treatment base class (first iteration)

Scalar models

doubleml/plm/plr_scalar.py — DoubleMLPLRScalar with cate(), gate(), _partial_out()
doubleml/irm/irm_scalar.py — DoubleMLIRMScalar with cate(), gate(), weighted scores (array + dict-with-weights_bar)

Vector models

doubleml/plm/plr_vector.py — DoubleMLPLRVector, validated against legacy DoubleMLPLR for multi-treatment

Cross-cutting features

Optuna tuning — tune_ml_models() on DoubleMLScalar with pruning support, _LEARNER_PARAM_ALIASES (e.g. IRM ml_g → [ml_g0, ml_g1]), and a _get_tuning_data() hook for subclass-specific tuning targets
Nuisance evaluation — nuisance_targets, nuisance_loss, and evaluate_learners(metric=...) with auto-defaulted RMSE / log-loss and NaN-aware masking
Sensitivity analysis — vectorized _sensitivity_element_est() hook running over all reps post-fit, with framework-ready shapes; supports the full sensitivity_analysis() pipeline
DoubleMLBLP per-rep basis — basis may be a single pd.DataFrame (shared) or list[pd.DataFrame] of length n_rep. Also fixes a multi-rep / multi-column bug in legacy DoubleMLPLR.cate() (doubleml/utils/blp.py)

Test suites

Every scalar model ships with the mandatory 5-file structure plus dedicated files for tuning, evaluation, and sensitivity:

test_<model>_scalar.py — 3-sigma estimation accuracy
test_<model>_scalar_return_types.py — property types/shapes
test_<model>_scalar_exceptions.py — input validation
test_<model>_scalar_vs_<model>.py — exact match with legacy (rtol=1e-9)
test_<model>_scalar_external_predictions.py — external-prediction equivalence
test_<model>_scalar_tune_ml_models.py — Optuna tuning
test_<model>_scalar_evaluate_learners.py — nuisance loss / metrics
test_<model>_scalar_sensitivity.py — sensitivity bounds & monotonicity
test_<model>_scalar_cate_gate.py — CATE/GATE (PLR & IRM)

PLR vector ships with 5 corresponding test files. Plus shared scalar-level tests in doubleml/tests/ (cluster splits, fit, set-sample-splitting, tune-pruning, tune-exceptions, ext-predictions).

Tooling & docs

.claude/CLAUDE.md, .claude/STATUS.md — project guidance & branch status
.claude/rules/ — code conventions, error handling, performance, testing, scalar test structure
.claude/agents/, .claude/skills/ — reviewer agents and skills
.github/copilot-instructions.md + .github/copilot/ — mirrored guidance for Copilot
doc/diagrams/architecture.md, doc/diagrams/testing_structure.md

Feature Parity with Legacy Classes

Feature	Status
`cate()` / `gate()` (PLR + IRM)	ported
`_partial_out()` (PLR)	ported
Array weights (IRM)	supported
Dict weights with `weights_bar` (IRM)	supported, validated via `_check_smpls_dependent_inputs()` hook
`policy_tree()` (IRM)	not yet ported
Callable score	intentionally not ported (design decision)
`trimming_rule` / `trimming_threshold` deprecated props	replaced by `ps_processor_config`

Backwards Compatibility

All legacy classes (DoubleMLPLR, DoubleMLIRM, …) remain unchanged and pass their existing test suites.
The new hierarchy is additive — exported alongside the legacy API in doubleml/__init__.py.
A latent multi-rep / multi-column bug in legacy DoubleMLPLR.cate() (basis * D_tilde mis-broadcast for n_rep > 1 and d_basis > 1) is fixed via the new BLP per-rep API.

Test Plan

pytest -m ci passes locally
pytest doubleml/plm/tests/ doubleml/irm/tests/ -v — full module suites for both refactored and legacy classes
pytest doubleml/tests/test_scalar_*.py -v — shared scalar infrastructure tests
black ., ruff check ., mypy doubleml clean (pre-existing mypy errors not introduced by this branch)
Spot-check summary, confint, bootstrap, and sensitivity_analysis() on a fitted DoubleMLPLRScalar and DoubleMLIRMScalar
Spot-check exact-match tests (*_scalar_vs_*.py) at rtol=1e-9

Follow-ups (out of scope)

DoubleMLIRMVector
DoubleMLPLIVScalar, DoubleMLPLPRScalar
DID scalar variants (DID, DIDCSBinary, DIDMulti)
DoubleMLVector base-class tests
IRM policy_tree() port

- Introduced learner management in DoubleMLScalar with properties for learner names and instances. - Added abstract method `set_learners` to enforce learner setting in subclasses. - Updated PLR to utilize the new learner management system, including validation checks for learner instances. - Refactored tests to align with the new learner management approach, ensuring proper exception handling and validation.

…nd utility functions

- Implemented the IRM class for double machine learning with interactive regression models in irm_scalar.py. - Added core estimation tests for IRM scalar in test_irm_scalar.py. - Created exception handling tests for IRM scalar in test_irm_scalar_exceptions.py. - Developed tests for handling external predictions in test_irm_scalar_external_predictions.py. - Added return type validation tests for IRM scalar in test_irm_scalar_return_types.py. - Compared the new IRM scalar implementation against the existing DoubleMLIRM in test_irm_scalar_vs_irm.py.

…standards, error handling, performance guidelines, and testing conventions.

…clarity

…d tests for cluster-based sample splitting and external prediction validation.

…rs; enhance tests for return types and reset behavior.

…, testing, and scalar model test structure

…g; update tests for consistency

- Added `_sensitivity_element_est` method to `DoubleMLScalar`, `IRM`, and `PLR` classes to compute sensitivity elements including sigma2, nu2, and their influence functions. - Introduced `sensitivity_elements` property to retrieve computed sensitivity elements after model fitting. - Implemented validation checks for sensitivity elements in `DoubleMLScalar`. - Added exception handling for sensitivity analysis methods in `IRM` and `PLR` classes to ensure proper input types and values. - Created unit tests for sensitivity analysis, including checks for element shapes, bounds, and exception handling in both `IRM` and `PLR` models. - Ensured compatibility of sensitivity elements between scalar and legacy models in comparison tests.

… for weights

…ndling in DoubleMLScalar

- Implemented `cate()` and `gate()` methods in `IRM` and `PLR` classes for estimating conditional average treatment effects. - Enhanced `DoubleMLBLP` to support per-rep basis for multi-rep scenarios. - Updated tests for `IRM` and `PLR` to validate new functionality, including checks for correct handling of multi-rep bases and group effects. - Improved validation of basis inputs in `DoubleMLBLP` to accept both single DataFrame and list of DataFrames. - Added new test cases to ensure robustness of the new features and backward compatibility with legacy models.

…sion and add comprehensive tests

… and enhance error handling in PLR and LearnerSpec validation

… checks into dedicated functions

Apply ruff D200/D213/D413 auto-fixes and add __init__ docstrings to DoubleMLVector and PLRVector.

…reamline sample comparison logic in tests

…bleMLScalar class

… DoubleMLScalar class

JanTeichertKluge

Thanks for the comprehensive refactoring. I think the changes are beneficial for the package as a whole. The multi-level hierarchy (DoubleMLBase → DoubleMLScalar → LinearScoreMixin → PLR/IRM) is very logical and clear. The centralization of _LEARNER_SPECS / validate_learner is also a particularly successful improvement over the previously _check_learner calls.
I have noted a few minor points in the comments.

JanTeichertKluge · 2026-05-18T11:32:46Z

+            ml_l_info = self._learners["ml_l"]
+            self._learners["ml_g"] = LearnerInfo(
+                learner=clone(ml_l_info.learner),
+                is_classifier=ml_l_info.is_classifier,


Aren't we effectively skipping the validation step here? E.g. the ml_g has allow_classifier=False, but if ml_l is a classifier (which would be valid?), the clone will inherit is_classifier=True

JanTeichertKluge · 2026-05-18T11:34:51Z

+    def __init__(
+        self,
+        obj_dml_data: DoubleMLBaseData,
+        score: str = "default",


Not sure about this. See comment on super().__init__()

JanTeichertKluge · 2026-05-18T11:36:48Z

+            )
+
+        # Call parent constructor
+        super().__init__(obj_dml_data)


We call super after setting score = "default".

JanTeichertKluge · 2026-05-18T11:38:02Z

+        # Call parent constructor
+        super().__init__(obj_dml_data)
+
+        self._score = score


and we overwrite the score here. I think the score is, for every child lcass model like plr, irm etc., inherited and set to the specific defaults.

SvenKlaassen · 2026-05-18T12:32:55Z

+        if has_l and not has_g:
+            warnings.warn("For score='IV-type', ml_g not set. Cloning ml_l to ml_g.")
+            # Clone the learner and register with same info
+            from ..utils._learner import LearnerInfo


This import should also be moved

SvenKlaassen added 29 commits January 31, 2026 09:58

first iteration of scalar implementation

a4b880c

refactor DoubleMLScalar to split fit() into separate parts

4f4c255

add plr_scalar implementation

ae2e5be

fix external predictions for doublemlscalar

dad5e4c

Add architecture documentation for DoubleMLScalar and class hierarchy

384beba

Add code simplifier and technical debt finder documentation

838d0ca

Enhance DoubleMLScalar and PLR with learner management, validation, a…

54e9eb4

…nd utility functions

Refactor documentation and guidelines for DoubleML, including coding …

0947c9d

…standards, error handling, performance guidelines, and testing conventions.

Refactor IRM class type hints to use built-in types and improve code …

48ae3a9

…clarity

Refactor DoubleMLScalar to enhance sample splitting functionality; ad…

1886a7d

…d tests for cluster-based sample splitting and external prediction validation.

Refactor IRM and PLR classes to reset fit state after updating learne…

0ca053d

…rs; enhance tests for return types and reset behavior.

Add copilot documentation for code style, error handling, performance…

33c8b01

…, testing, and scalar model test structure

Enhance DoubleMLScalar and IRM classes for stratified sample splittin…

45c5f48

…g; update tests for consistency

add post_nuisance checks

0051c77

Merge branch 'main' into sk-refactoring

3486f5a

add guideline for using absolute imports from project root

35434bb

add guidelines for tuning tests and required fixtures for scalar models

d195fff

Enhance DoubleMLScalar with improved tuning functionality and tests

15216f0

add nuisance evalutaion

b0026da

add first dml vector class

e980cca

Add branch status and TODOs documentation for sk-refactoring

17cf8f3

Refactor weight handling in IRM and add comprehensive exception tests…

3818c2b

… for weights

Merge branch 'main' into sk-refactoring

fdcf936

refactor: enhance validation for weights_bar in IRM and update fit ha…

82d95a5

…ndling in DoubleMLScalar

feat: Implement PLRVector for multi-treatment partially linear regres…

71ef483

…sion and add comprehensive tests

SvenKlaassen requested a review from JanTeichertKluge May 9, 2026 13:33

github-advanced-security AI found potential problems May 9, 2026

View reviewed changes

Comment thread doubleml/irm/tests/test_irm_scalar_exceptions.py Fixed

SvenKlaassen added 11 commits May 9, 2026 16:44

refactor: move Self type hint import to typing_extensions for 3.10

1ae721c

Fix high priority codacy issues: update set_learners method signature…

f1c0bcd

… and enhance error handling in PLR and LearnerSpec validation

fix medium codacy issues: streamline learner validation by extracting…

d74e9f9

… checks into dedicated functions

docs: fix docstring lint on new scalar/vector implementations

4c0fe2a

Apply ruff D200/D213/D413 auto-fixes and add __init__ docstrings to DoubleMLVector and PLRVector.

refactor: simplify set_learners method signature by removing kwargs

b996235

refactor: remove redundant pass statements in abstract methods and st…

39a0101

…reamline sample comparison logic in tests

refactor: remove redundant pass statement in DoubleMLScalar class

9913dbf

refactor: remove redundant pass statements in abstract methods of Dou…

bd75efb

…bleMLScalar class

refactor: simplify docstring for set_learners method in PLRVector class

93247fd

refactor: add doctest skip directive to evaluate_learners examples in…

d56d105

… DoubleMLScalar class

refactor: enhance basis validation in DoubleMLPLR and PLR classes

3ffa823

JanTeichertKluge approved these changes May 18, 2026

View reviewed changes

SvenKlaassen commented May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sk refactoring#394

Sk refactoring#394
SvenKlaassen wants to merge 40 commits into
mainfrom
sk-refactoring

SvenKlaassen commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

JanTeichertKluge left a comment

Uh oh!

JanTeichertKluge May 18, 2026

Uh oh!

JanTeichertKluge May 18, 2026

Uh oh!

JanTeichertKluge May 18, 2026

Uh oh!

JanTeichertKluge May 18, 2026

Uh oh!

SvenKlaassen May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SvenKlaassen commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

New Class Hierarchy

Key Design Decisions

What's Included

Core infrastructure

Scalar models

Vector models

Cross-cutting features

Test suites

Tooling & docs

Feature Parity with Legacy Classes

Backwards Compatibility

Test Plan

Follow-ups (out of scope)

Uh oh!

Uh oh!

JanTeichertKluge left a comment

Choose a reason for hiding this comment

Uh oh!

JanTeichertKluge May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JanTeichertKluge May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JanTeichertKluge May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JanTeichertKluge May 18, 2026

Choose a reason for hiding this comment

Uh oh!

SvenKlaassen May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SvenKlaassen commented May 9, 2026 •

edited

Loading